{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "82898ade-d48e-4b08-be5a-89bb60ee8888",
   "metadata": {},
   "source": [
    "# 基于vLLM的Yuan 2.0推理服务部署\n",
    "\n",
    "## 1. 配置vLLM环境\n",
    "环境要求：torch2.1.2 cuda12.1\n",
    "\n",
    "vLLM环境配置主要分为以下两步，拉取Yuan-2.0项目，以及安装vllm运行环境\n",
    "\n",
    "注：由于pip版本 vllm目前还不支持Yuan 2.0，因此需要编译安装\n",
    "\n",
    "### Step 1. 拉取Yuan-2.0项目\n",
    "\n",
    "```bash\n",
    "# 拉取项目\n",
    "git clone https://github.com/IEIT-Yuan/Yuan-2.0.git\n",
    "```\n",
    "\n",
    "### Step 2. 安装vLLM运行环境\n",
    "\n",
    "```bash\n",
    "# 进入vLLM项目\n",
    "cd Yuan-2.0/3rdparty/vllm\n",
    "\n",
    "# 安装依赖\n",
    "pip install -r requirements.txt\n",
    "\n",
    "# 安装setuptools\n",
    "# vllm对setuptools的版本有要求, 参考 https://github.com/vllm-project/vllm/issues/4961\n",
    "vim pyproject.toml # 修改setuptools == 69.5.1\n",
    "pip install setuptools == 69.5.1\n",
    "\n",
    "# 安装vllm\n",
    "pip install -e .\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "051681f7-6efc-428a-ab64-fa38a6059092",
   "metadata": {},
   "source": [
    "## 2. Yuan2.0-2B模型基于vLLM的推理和部署\n",
    "\n",
    "以下是如何使用vLLM推理框架对Yuan2.0-2B模型进行推理和部署的示例\n",
    "\n",
    "### Step 1. 模型下载\n",
    "\n",
    "使用 modelscope 中的 snapshot_download 函数下载模型，第一个参数为模型名称，参数 cache_dir 为模型的下载路径。\n",
    "\n",
    "这里可以先进入autodl平台，初始化机器对应区域的的文件存储，文件存储路径为'/root/autodl-fs'。\n",
    "该存储中的文件不会随着机器的关闭而丢失，这样可以避免模型二次下载。\n",
    "\n",
    "![autodl-fs](images/autodl-fs.png)\n",
    "\n",
    "然后运行下面代码，执行模型下载。模型大小为 4.5GB，下载大概需要 5 分钟。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d1395959-f325-4d92-a833-792f431df85a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from modelscope import snapshot_download\n",
    "model_dir = snapshot_download('YuanLLM/Yuan2-2B-Mars-hf', cache_dir='/root/autodl-fs')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c62d8046-6e27-41bf-8ea9-4342df4c6d27",
   "metadata": {},
   "source": [
    "### Step 2. 基于vllm推理Yuan2.0-2B\n",
    "\n",
    "基于vllm推理Yuan2.0-2B首先需要加载模型，然后进行推理\n",
    "\n",
    "#### 1. 加载模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d4f54db7-90ef-4b86-ba62-98fd36051d98",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 06-22 17:21:04 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', tokenizer='/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n",
      "You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 06-22 17:21:11 llm_engine.py:231] # GPU blocks: 1627, # CPU blocks: 780\n",
      "INFO 06-22 17:21:13 model_runner.py:412] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
      "INFO 06-22 17:21:13 model_runner.py:416] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.\n",
      "INFO 06-22 17:21:20 model_runner.py:458] Graph capturing finished in 7 secs.\n"
     ]
    }
   ],
   "source": [
    "from vllm import LLM, SamplingParams\n",
    "import time\n",
    "\n",
    "# 配置参数\n",
    "sampling_params = SamplingParams(max_tokens=300, temperature=1, top_p=0, top_k=1, min_p=0.0, length_penalty=1.0, repetition_penalty=1.0, stop=\"<eod>\", )\n",
    "\n",
    "# 加载模型\n",
    "llm = LLM(model=\"/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf\", trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "650450d8-99aa-4fd7-bc2b-d59d133d424e",
   "metadata": {},
   "source": [
    "#### 2. 推理\n",
    "推理支持单个prompt和多个prompt\n",
    "\n",
    "##### Option 1. 单个prompt推理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9fb32302-95e0-4207-ae1d-25b08ab772a0",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.13it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: 给我一个python打印helloword的代码<sep>\n",
      "Generated text:  以下是一个简单的Python代码，用于打印字符串\"hello world\"：\n",
      "```python\n",
      "print(\"hello world\")\n",
      "```\n",
      "\n",
      "inference_time: 0.47487425804138184\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "prompts = [\"给我一个python打印helloword的代码<sep>\"]\n",
    "\n",
    "start_time = time.time()\n",
    "outputs = llm.generate(prompts, sampling_params)\n",
    "end_time = time.time()\n",
    "\n",
    "for output in outputs:\n",
    "    prompt = output.prompt\n",
    "    generated_text = output.outputs[0].text\n",
    "    print(\"Prompt:\", prompt)\n",
    "    print(\"Generated text:\", generated_text)\n",
    "    print()\n",
    "\n",
    "print(\"inference_time:\", (end_time - start_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbe4945e-d958-4d6e-b82a-8d09854cd14b",
   "metadata": {},
   "source": [
    "##### Option 2. 多个prompt推理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8ec65a71-94e5-4907-b12d-a2703ed6180d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 2/2 [00:00<00:00,  3.27it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: 给我一个python打印helloword的代码<sep>\n",
      "Generated text:  以下是一个简单的Python代码，用于打印字符串\"hello world\"：\n",
      "```python\n",
      "print(\"hello world\")\n",
      "```\n",
      "\n",
      "Prompt: 给我一个c++打印helloword的代码<sep>\n",
      "Generated text:  ```cpp\n",
      "#include <iostream>\n",
      "\n",
      "int main() {\n",
      "    std::cout << \"hello world\";\n",
      "    return \n",
      "}\n",
      "```\n",
      "\n",
      "inference_time: 0.6165323257446289\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "prompts = [\"给我一个python打印helloword的代码<sep>\", \"给我一个c++打印helloword的代码<sep>\"]\n",
    "\n",
    "start_time = time.time()\n",
    "outputs = llm.generate(prompts, sampling_params)\n",
    "end_time = time.time()\n",
    "\n",
    "for output in outputs:\n",
    "    prompt = output.prompt\n",
    "    generated_text = output.outputs[0].text\n",
    "    print(\"Prompt:\", prompt)\n",
    "    print(\"Generated text:\", generated_text)\n",
    "    print()\n",
    "\n",
    "print(\"inference_time:\", (end_time - start_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5eac97d-5602-4df0-9433-ccf5af0a87aa",
   "metadata": {},
   "source": [
    "### Step 3. 基于vllm.entrypoints.api_server部署Yuan2.0-2B\n",
    "基于api_server部署Yuan2.0-2B的步骤包括推理服务的发起和调用\n",
    "\n",
    "#### 1. 服务发起\n",
    "\n",
    "```bash \n",
    "# 请在命令行运行以下命令，不用直接在jupyter中使用!python运行\n",
    "python -m vllm.entrypoints.api_server --model=/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf --trust-remote-code\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcc996fd-980c-4150-ac0c-f9a5f4f3804f",
   "metadata": {},
   "source": [
    "#### 2. 服务调用\n",
    "服务调用有以下两种方式：第一种是通过命令行直接调用；第二种方式是通过运行脚本批量调用。\n",
    "\n",
    "##### Option 1. 基于命令行调用服务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c98d27d2-00de-4926-a6bc-8a29612fcf0f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"text\":[\"给我一个python打印helloword的代码<sep> 以下是一个简单的Python代码，用于打印字符串\\\"hello world\\\"：\\n```python\\nprint(\\\"hello world\\\")\\n```\"]}"
     ]
    }
   ],
   "source": [
    "!curl http://localhost:8000/generate -d '{\"prompt\": \"给我一个python打印helloword的代码<sep>\", \"use_beam_search\": false,  \"n\": 1, \"temperature\": 1, \"top_p\": 0, \"top_k\": 1,  \"max_tokens\":256, \"stop\": \"<eod>\"}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d76b69f-264a-403b-8f09-36bfd67b115f",
   "metadata": {},
   "source": [
    "##### Option 2. 基于命令脚本调用服务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "32ee6e42-3687-407e-9bc6-92ff4c92147a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'text': ['给我一个python打印helloword的代码<sep> 以下是一个简单的Python代码，用于打印字符串\"hello world\"：\\n```python\\nprint(\"hello world\")\\n```']}\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "prompt = \"给我一个python打印helloword的代码<sep>\"\n",
    "raw_json_data = {\n",
    "        \"prompt\": prompt,\n",
    "        \"logprobs\": 1,\n",
    "        \"max_tokens\": 256,\n",
    "        \"temperature\": 1,\n",
    "        \"use_beam_search\": False,\n",
    "        \"top_p\": 0,\n",
    "        \"top_k\": 1,\n",
    "        \"stop\": \"<eod>\",\n",
    "        }\n",
    "json_data = json.dumps(raw_json_data)\n",
    "headers = {\n",
    "        \"Content-Type\": \"application/json\",\n",
    "        }\n",
    "response = requests.post(f'http://localhost:8000/generate',\n",
    "                     data=json_data,\n",
    "                     headers=headers)\n",
    "output = response.text\n",
    "output = json.loads(output)\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a98466-d0cc-41a5-9c00-e21a624dbb53",
   "metadata": {},
   "source": [
    "### Step 4. 基于vllm.entrypoints.openai.api_server部署Yuan2.0-2B\n",
    "基于openai的api_server部署Yuan2.0-2B的步骤和step 3的步骤类似，发起服务和调用服务的方式如下：\n",
    "\n",
    "#### 1. 服务发起\n",
    "\n",
    "```bash \n",
    "# 请在命令行运行以下命令，不用直接在jupyter中使用!python运行\n",
    "python -m vllm.entrypoints.openai.api_server --model=/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf --trust-remote-code\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06727da0-37e8-4dee-b7df-f1cc90b941de",
   "metadata": {},
   "source": [
    "#### 2. 服务调用\n",
    "服务调用有以下两种方式：第一种是通过命令行直接调用；第二种方式是通过运行脚本批量调用。\n",
    "\n",
    "##### Option 1. 基于命令行调用服务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b8a0da33-f0a4-49e1-b0ee-8ae2816f1f7a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"id\":\"cmpl-5f7c39b38f4048928f0c2c6482174a72\",\"object\":\"text_completion\",\"created\":17034486,\"model\":\"/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf\",\"choices\":[{\"index\":0,\"text\":\" 以下是一个简单的Python代码，用于打印字符串\\\"hello world\\\"：\\n```python\\nprint(\\\"hello world\\\")\\n```\",\"logprobs\":null,\"finish_reason\":\"stop\"}],\"usage\":{\"prompt_tokens\":11,\"total_tokens\":39,\"completion_tokens\":28}}"
     ]
    }
   ],
   "source": [
    "!curl http://localhost:8000/v1/completions -H \"Content-Type: application/json\" -d '{\"model\": \"/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf\", \"prompt\": \"给我一个python打印helloword的代码<sep>\", \"max_tokens\": 300, \"temperature\": 1, \"top_p\": 0, \"top_k\": 1, \"stop\": \"<eod>\"}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04a2ee8d-f576-4c86-9e33-5d1fecabc70a",
   "metadata": {},
   "source": [
    "##### Option 2. 基于命令脚本调用服务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3b747965-e625-4930-b053-920250905e83",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'id': 'cmpl-783bba4e302b4fe48e7f45a35c9e4c05', 'object': 'text_completion', 'created': 17034591, 'model': '/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', 'choices': [{'index': 0, 'text': ' 以下是一个简单的Python代码，用于打印字符串\"hello world\"：\\n```python\\nprint(\"hello world\")\\n```', 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'total_tokens': 39, 'completion_tokens': 28}}\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "prompt = \"给我一个python打印helloword的代码<sep>\"\n",
    "raw_json_data = {\n",
    "        \"model\": \"/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf\",\n",
    "        \"prompt\": prompt,\n",
    "        \"max_tokens\": 256,\n",
    "        \"temperature\": 1,\n",
    "        \"use_beam_search\": False,\n",
    "        \"top_p\": 0,\n",
    "        \"top_k\": 1,\n",
    "        \"stop\": \"<eod>\",\n",
    "        }\n",
    "json_data = json.dumps(raw_json_data, ensure_ascii=True)\n",
    "headers = {\n",
    "        \"Content-Type\": \"application/json\",\n",
    "        }\n",
    "response = requests.post(f'http://localhost:8000/v1/completions',\n",
    "                     data=json_data,\n",
    "                     headers=headers)\n",
    "output = response.text\n",
    "output = json.loads(output)\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61af24be-7125-4d04-bf1b-11c607813aa2",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py38",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.19 (default, Mar 20 2024, 19:58:24) \n[GCC 11.2.0]"
  },
  "vscode": {
   "interpreter": {
    "hash": "c6bd51bef5b4ff0d27ffa8191406cdd00b92260c116fd6161fe29291443fd9f8"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
